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Abstract: The problem of estimating the number of unique types or dis- 
tinct species in a group occurs in many fields, but is not a straightforward 
one. Given an arbitrary probabilistic distribution of entries to be sam- 
pled, this study shows, in exact and approximated forms, the conditional 
probabilistic distribution of unique entries sampled independently given the 
number of samples. The numerical evaluation validates the proposed ap- 
proximation. Finally, we discuss a theory that unifies the related approaches 
regarding estimation of the number of classes. 
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1. Introduction 

Estimation of the number of unique types or distinct species in a group is a prob- 
lem found in many fields. For example, a linguist may study the vocabulary size 
of an author, an ecologist would be interested in species abundance in a region, 
and a computer scientist may investigate the number of attributes in a relational 
database. In an empirical estimation process, the number of potential types is 
unknown a priori and, thus, the sample size that is sufficiently large to contain 
all the potential types is also unknown. How can we count the number of types 
remaining unknown from only a known set of types? It is not straightforward to 
estimate the number of classes, such as word types or biological species, when 
most of the entries are rarely sampled. In such situations, a naive estimator, 
the observed number of unique entries, severely underestimates the number of 
potential types, including unobserved types not appearing in a given dataset. 

Consider the problem of estimating the number of unique words in a corpus 
by drawing one word from it at a time. We refer to the number of sampled 
words, or tokens, as the token size and the number of unique words, or types, 
as the type size. The basic problem of interest here is to estimate a potential 
type size N given a token size M in a situation where sampling a token of 
type at a particular time follows a probabilistic distribution, referred to as a 
word distribution, {pi,P2, ■ ■ ■ ,Pi, ■ ■ ■ ,Pn} where pt > (i = 1,2,..., TV) and 
^2 i Pi = 1. Estimating the number of types, or distinct biological species, is a 
central problem in the assessment of species abundance in ecology [6, 7, 13, 17] or 
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lexical diversity in corpus linguistics [18, 22, 23, 24, 25, 26, 28, 30, 31]. Estimating 
type size is not a trivial problem when the latent type size is large relative to 
a given token size or there are relatively many rarely observed types, which is 
often the case. We are likely to observe a limited number of types in a relatively 
small number of tokens when each word is drawn from a long-tail distribution. 
Indeed, many empirical studies have found the long-tail property in various 
empirical word distributions [15, 20, 31]. 

Most past studies have taken two distinct approaches [5] . The first is an anal- 
ysis based on the frequency spectrum, or the set of frequencies of each frequency. 
Then, the "frequency of frequency", fjf > 0, denotes the number of types that 
appear k times in a given dataset of M tokens [13, 14]. A typical problem is to 
estimate f^ 1 , which is the number of latent types appearing zero times, i.e., un- 
observed, based on the observed frequency spectrum {/^ , , . ■ ■ , f^} where 
n is the highest frequency of a type in the dataset. In connection to this fre- 
quency spectrum approach, various kinds of estimators and statistical models 
have been proposed for finite and infinite population sizes [5]. 

The second approach, which is the primary focus of this study, is the so-called 
"curve fitting" approach [5]. In this line of thought, researchers have analyzed 
type-to-token curves in linguistics and species-to-area curves in ecology to eval- 
uate the number of distinct words or species, respectively [2, 4, 7, 16]. The 
number of types K as a function of the number of tokens M is generally char- 
acterized as a monotonic increasing convex-upward function. A typical problem 
is to estimate the total number of types, including the set of unobserved types, 
lim^/^oo K(M), from a given set of numbers of tokens {mi, m^, ■ ■ . , m n >} paired 
with the number of sampled types {fci(mi), ^2(1712), . . . , k n >(m n ')}. By defini- 
tion, K(M) = Y17=i ffi wn i cn gives the relationship between the frequency 
spectrum and the type-to-token curve. The type-to-token curve analyzes the 
number of sampled types as a function of the number of tokens M , while the 
frequency spectrum analyzes the distribution over frequency / Z M for a fixed M. 

Typically, researchers have analyzed the type-token distribution and the fre- 
quency spectrum approaches separately. The type-to-token curve has often been 
analyzed in the context of lexical diversity [28] . One of the goals of these studies 
is to derive a substantially invariant indicator for lexical diversity as a func- 
tion of the number of tokens. A number of past attempts partially relied on 
some model of the number of types as a function of the number of tokens 
(for example, the logarithmic function [16]). However, these have been found 
to be inconstant across different token sizes [28]. More recent approaches in- 
clude parameter estimation based on regression of the type-token curve with 
the lower-order moments [2] or with approximated distributions [22] , as well as 
specific iterative methods [22, 25, 28]. Despite these various proposals, there are 
few acceptable measures of lexical diversity that are invariant in relation to the 
number of tokens [25, 28]. 

We believe that these past attempts to create a theoretical description of 
type-token behavior have been, at least partially, due to their lack of a principled 
characterization of the type-token distribution in a general form. Past studies 
have presented the mean and variance of K{M) under independent sampling 
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schemes [2, 3, 4, 12]. To our knowledge, however, there has been no research on a 
rigorous probabilistic distribution for the type-to-token curve, except for its first 
and second moments. The theoretical foundation for the type-token relationship 
is yet to be established. This study establishes the mathematical groundwork 
for the probabilistic distribution of the number of types in a general form. 

1.1. Approach 

In this study, we show the probabilistic distribution of the number of types K 
found in M tokens when each word is drawn from an arbitrary word distri- 
bution Af = {pi,P2, ■ ■ ■ ,Pn} over N types with replacement. We refer to the 
probabilistic distribution of the number of sampled types K, given the number 
of sampled tokens M and the latent word distribution Af, P(K\M,Af), as the 
type-token distribution. We show the type-token distribution for an arbitrary 
word distribution M in an exact form and an asymptotic distribution with re- 
spect to the limit of infinite token size as M — > oo. Using certain assumptions 
about the word distribution Af provides various special cases that have been 
studied previously. Therefore, it gives a general mathematical basis with which 
we can view the class of word distribution as a set of parameters and optimize 
it for a given dataset. The latent number of types N, or the number of types 
in the distribution with infinite samples, can be estimated by fitting a model 
with a word distribution Af as its parameters to given data using the maximum 
likelihood approach (or posterior distribution in Bayesian statistics). Adopting 
a Bayesian point-of-view, this provides a hierarchical model in which a class of 
word distribution is drawn to provide a set of data of observed types. 

The result of our derivation shows that, in general, the probabilistic distribu- 
tion P(K\M,Af) in its exact form is computationally costly when dealing with 
large numbers of types N and tokens M . Thus, we also address the problem 
with an approximation to enable it to be calculated more efficiently. A close 
approximation of P{K\M,Af) is given as a Poisson binomial distribution when 
the token size M is sufficiently large. A numerical experiment is reported to 
evaluate the validity of the approximation. Finally, we discuss the role of the 
general model as an integrated sampling model unifying both the type-token 
and frequency spectrum analyses. 

2. Probabilistic distribution of the number of types given tokens 

Suppose that a word token is drawn from a corpus of types with a probabilistic 
distribution Af = {pi,P2, ■ ■ ■ ,Pn}, where pi > and Pi = 1. Let p s = 
12i£s Pi ^ e the probability of drawing a token of a type from the type subset 
s = {si, S2, ■ ■ ■ , S|s|}) where \s\ denotes the size of the set s, and p% = for the 
empty set 0. Then, the probability of drawing a type subset s with K entries 
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given M tokens, denoted by P(\s\ = K\M,M) or P(K\M,JV), is given as follows: 
P(K\M,M) = P (*\ M >*0 (2- 1 ) 

{s:\s\=K} 

fc=l {s:|s| = fc} 

where {s : \s\ = k} denotes all possible combinations of subsets s with set 
size |s| = k drawn from the set of N indices {1,2,..., N}, and P(s\M,Af) is the 
probability of each type in the set of words s = {si, s%, . . . , sui} with probability 
N being found at least once in M tokens, which is determined as follows: 

P{s\M,M)=pf-Y,P%+H P?\{tj}- E PA«j.k } +~ •+(-i) |s| - 1 E^ 

(2.3) 

Proof. Obviously, the probability of P(s\M,Af) is zero for \s\ > M. The prob- 
ability of P(s\M,Af) for \s\ < M can be written with the following recursive 
equation, which can be viewed as an extension of the Chapman-Kolmogorov 
equations [2]: 

P(s\M, AT) = P(s\M - l,M)p s + p ( s \ *\ M - hM)Pi, (2-4) 

where the right hand side consists of the sum of the probabilities in the two cases 
- either drawing a token of the types s (p s ) when the type subset s is already 
drawn for [M— 1) tokens (P(s\M,Af)) or drawingaword of the new type i (pi for 
i G s) when the type subset s, excluding i, are already drawn for (M — 1) tokens 
(P(s \ i\M,Af)). Using the recursive Equation (2.4), we prove the right hand 
side of Equation (2.3) is a polynomial function of p s , assuming P(s\M — 1,J\T) in 
the mathematical induction below. In the special case of m = 1, P(i\l,Af) = Pi 
satisfies Equation (2.3) trivially. Let us assume that Equation (2.3) is proven for 
P(s\m,Af) (m < M — 1). Then, we prove P(s\M,N) for s = {si,s 2 , . . . , S| s |}, 
based on the previous step P(s\M — 1,7V) for M > 1. Combining the definitions 
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of P(s\m,Af) for m < M — 1 in Equations (2.3) and 2.4, we get: 
P(s\M,N) = p s P(s\M - + ^piP(s\i\M -1,A0 



M^-E^+E^- E 

jgs i<j£s i<j<kEs 

+ E» (W 1 -E<« + E • • •) ( 2 -5) 

i£s y jGs\i j<k£s\i ) 

=ri # -E^*W 1 -E^i}+ E •••W 

«£s jSs\i j<k£s\i J 

= -5^p, Xi P(s\i|M-l,JV\i). (2.7) 
Note that Equation (2.6) is derived via the following method: 

E Ps V'i^s\{ii,i 2 ,...,i„} =Ps E P s\{ii,i2,...,i„} ~~ E P «^s\{ii,i 2 ,...,i„} 

We now apply the recursive Equation (2.7): 
P( S |Af,A0 - pf-5^p aNi P( fl \*|M-l,jV\*) 



ids 

E^V + E^V E ^\fe} P ( s \{^j}|M-2,AA\{i,j}) 

iCs iCs jCs\i 



n M \^ n M , \ " A/ A/ , / nlsl-lV^ M 

- /'.- ~ 2^P»V + 2^ P*\{i,3} 1^ Ps\{i,j,k}--- + ^ L > l^ p i 

ids i<j£s i<j<k£s ids 

where P(0|M - |s|,7V\s) =0. 

Thus, the probabilistic distribution of a specific set P(s\M,Af) (Equation 
2.3) has been proved for M > 1. Then, Equation 2.1 is trivial. □ 

,2. J . Moment generating function 

Next, we describe the moment-generating function for an arbitrary order of mo- 
ments of the type-token distribution. The moment-generating function of the 
type-token distribution (Equation 2.2), defined by Mp(t) = J2k=i exp(Kt)P(K\M,Af), 
is as follows. 

N 

M P(t) = E E P* cxp(fct)(l~exp(t)) Ar - fc (2.8) 

k—1 s:\s\-k 
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The asymptotic probabilistic distribution of P(K\M,Af) with the limit of M — >• 
oo is shown using the moment-generating function Mp(t) (Equation 2.8) in 
Section 2.3. 

Proof. According to Equation 2.2: 

N K 

M p(t) = £ £(-l)*- fc "-fcGv-* £ Pfexp(ift) 

K=lk=l s:|s|=fe 
TV AT 

= ££ i)*-^v- fe cw E pf ex p(^*) 

A'=lfe=l s:|s|=fc 
2V AT 

= E E E(- 1 )^ fc ^^-^ cx p(^) 

fc=ls:|s|=fc K=k 
N N' 

= E E exp(H) E(-l)"V'^_^exp(A^) 

fc=ls:|s|=fc K'=0 

where S(x) is the Hcavisidc function, equal to 1 if x > and otherwise; 
A'' = K — k\ and AT' = N — k. Using the binomial theorem (p + q) n = 
Sfe=o P k( l n ~ k nCk, we obtain the moment-generating function (Equation 2.8) 

with Y,K'=o(- e Mt)) K 'N>C N ,-K' = (1 - exp(t))"'. □ 



2.1.1. The first and second order moments 

The moment-generating function gives the n th-order moment as 9 g^r^ |t=o- 
Consistent with past research on the first-order moment [2] (with the parameter 
for "blank tokens" ,8 = 0) and the second-order moment [3, 4, 12], the moment- 
generating function (Equation 2.8) gives the mean number of types E[K\M,J\f] 
in M tokens as follows: 

E[K\M,M]=J2(l-(l- Pl ) M ). (2.9) 

i 

Likewise, the second-order moment of the number of types E[K 2 \M, Af] is given 
by: 

E[K 2 \M,M] = A[A'|M,A/]+E£ i 1 + 0- ~ P{i,j}) M ~ U ~ Pi) M ~ ^~Pj) M } 

(2.10) 



2.2. Special case 

In the special case of the uniform distribution pi — jj for i = 1, 2, . . . , N, the 
probability of K types sampled in M tokens P(K\M,Af) = J2{s:\ s \=k\ P(s\M) 
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is given by: 

P ( K \ M >W = {N-K)\N" S{M ' K) ' (2 ' n) 



where: 

K 



S(M, K) = -1 <T(-iy K a(K - i) M (2.12) 
is called the Stirling number of the second kind [1]. 



2.3. Asymptotic distribution 

The exact computation of the type-token distribution P(K\M,Af) (Equation 
2.2) is an intractable problem for large M and N due to the computational cost 
of the sum of combinations X}{ s -|s|=fc} Ps ■ Thus, it is of practical importance 
to provide a reasonable approximation of the probabilistic distribution to allow 
for efficient computation. In this section, we show that, for a large token size 
M — > oo, the asymptotic probabilistic distribution of types P(K\M,M) is a 
Poisson binomial distribution Q(K\M,Af) with a set of parameters as follows: 

P{K\M,N) « Q(K\M,N). (2.13) 

The Poisson binomial distribution [8, 9, 27, 29] is given by: 

Q{K\M,N)= E 11^ II (* ( 2 - 14 ) 

{s:\s\=K } iGs jeW\s 

where g^M = 1 — (1 —Pi) AI - Most importantly, the Poisson binomial distribution 
Q(K\M, Af) can be computed efficiently with the recursive equations given below 
[8, 11, 27] (there is a further approximation of Q(K \M, AT) that provides an even 
more efficient computation method at the cost of approximation accuracy [21]). 

Proof. Let M Q {t) = E s: | s |=k exp(ATi) ]J^ S q iM Yl jeA r\ s {l-qjM) be the moment- 
generating function of the Poisson binomial distribution Q(K\M,Af). In order 
to prove the asymptotic relationship (Equation 2.13), we utilize a theoretical 
property of moment-generating functions, i.e., that two probabilistic distribu- 
tions with identical moment-generating functions are identical [19]. Specifically, 
M P (t) = M Q (t) P(K\M,N) = Q(K\M,Af). The moment-generating func- 
tion Mq{£) is given by [29]: 

N 



M Q(t) = Y[(l-<lk + q k exp(t)). (2.15) 



k=l 



We now show the moment-generating function of the type-token distribution 
(Equation 2.8) as it approaches Equation 2.15 with respect to the limit M — > oo. 
Expanding the moment-generating function (Equation 2.8) gives: 

M P {t) = e tN -e^^Xl-e*) ^(l-p^+e^-^l-e') 2 £ (l- Pij ) M +. ■ ■ 

ieJV i<jeM 



M 
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Expanding Mq (t) = Utid 1 ~ Pk) M + (1 " (1 - Pk) M )e<) = YlZnV + (1 
e *)(l-p fc ) M )), 

M Q (t) = e ^- e ^-i)(l-e*) ^(l_ Pi )M +e t(iV- 2 ) (1 _ et) 2 £ (1 _ pi) M (1 _ p . } 
and expanding Equation 2.15: 

JV JV 

M Q (t) = H((l- Pk ) M + (1 - (1 -Pk) M )e l ) = IJ^ + (1 - e 4 )(l -P.) M )) 
fc=i fc=i 

gives: 

M Q (i) = e tAr - e t ( JV - 1 )(l- e t )^(l-^) M +e t ( Ar - 2 )(l-e t ) 2 ]T (1- Pi )^(l_ ft .)^ 

The difference between the two moment-generating functions is calculated by: 
M P (t)-M Q (t) = e« N - 2 \l-e t f £ A£+e*<"-»(l-e*) a £ 

where: 

- < A* 7 = (1 - Pij ) M (1 - Pi ) A/ (l < 0, 

and 

-(l-p t ) M (l- Pj ) M (l-p k ) M < = (l- Pijk ) M -(l-p i ) M (l-p j ) M {l-p k ) M < 0, 
and so on. Then: 

lim Ay H- 0, lim -> 0, . . . 

and, therefore, with a sufficiently large token size M — > oo: 

lim M P (t) -t M Q (t). 

M — >oo 

This shows that the Poisson binomial distribution Q(K\M,Af) is a reasonable 
asymptotically approximation (Equation 2.13) where there is a large M. □ 



2-4- A criterion to stop sampling 



In an empirical estimation process, the number of latent types is unknown a 
priori and, thus, it is also unknown how many tokens are enough to estimate 
the number of types with a certain level of accuracy. In some situations with a 
particular cost for sampling, it is of practical importance to develop a criterion 
to stop the sampling process. In general, a larger token size gives a more accurate 
estimator, but is also more costly in terms of empirical data collection. Thus, a 
basic strategy is to evaluate how informative an additional sample is relative to 
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its cost, and to stop sampling when there is an appropriate balance between the 
accuracy of an estimator and the cost of data collection. The minimum bound 
for the variance of a maximum likelihood estimator is given by the inverse of 
the Fisher information. Therefore, we propose to keep sampling until the token 
size M is large enough to satisfy X(|7V|) _1 < X -1 , where X is a desired bound 
for the empirical Fisher information X(|A/"|) of a given dataset. The criterion X 
is empirically designed for an empirical cost of data collection. 

Given n pairs of numbers of tokens and types {rrii, n,} (i = 1,2, ... ,n), the 
likelihood function of the parameter M is given by L(Af) = 11™= 1 Q(ni\mi,N) 
using the Poisson binomial approximation (Equation 2.14). We next obtain an 
estimator N by maximizing the likelihood function L(J\f). For independent sam- 
ples of {rrii, m}^—! from the asymptotic distribution (Equation 2.14), the Fisher 
information X(|VV|) is given as follows: 

|7V| „ 

^W) = EE^W 1 - ^jr 1 , (2.i5) 

3=1 i=l 

where qj mi = 1 — (1 — Pi) mi is the maximum likelihood estimator of the pa- 
rameter. In general, X(|7V|) _1 is a monotonically decreasing function of n, and 
lim maXi mi _>oo X(|A/"|) _1 — > 0. This suggests having either a large sample size n 
or a large token size rrii in order to satisfy a given criterion X(|A/"|) _1 < X -1 . 
Equation (2.15) implies a general idea about which kind of word distribution 
{pi\i needs a larger sample size. For example, with a Zipf distribution pi cx i~ ai 
(a > 0, i = 1, 2, . . . , N), a less skewed word distribution with the exponent a\ 
needs a relatively smaller sample size than that with the exponent ci2 where 
ct2 > 01, given the identical number of types N and the number of tokens M. 

3. Numerical Experiments 

First, we numerically validated the theoretical results with a set of correspond- 
ing values in the numerical simulations. Figure 1 shows the average and stan- 
dard deviation of the number of types as a function of the number of tokens 
when the latent word distribution follows the Zipf distribution p^ oc fc _1 for 
k = 1,2,..., 500. These results indicate that the theoretical means and stan- 
dard deviations of types are fitted well to the sample. Next, we estimated the 
number of latent types when sampling words following a Zipf distribution. In 
the simulation manipulating the number of tokens, we performed a sensitivity 
analysis regarding small token size on the maximum likelihood estimator of the 
number of types based on the Poisson binomial approximation. Given n pairs of 
numbers of tokens and types {m„ rii} (i — 1,2, ... ,n), the likelihood function 
of the parameter a > and |W| > is as L(a,\AF\) = Y\ i=1 Q(ni\m i7 Af, a) 
by the Poisson binomial approximation (Equation 2.14) with word distribution 
p(k\Af, a) oc k~ a for k = 1,2,..., |7V|. We obtained the estimators of the ex- 
ponent a and the number of types \Af\ by maximizing the logarithm of the 
likelihood log L(a, \M\) as a function of the parameters. 
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We performed two Monte Carlo simulations. In the first simulation, we in- 
dependently drew 1000, 1500, and 2000 tokens of words from a corpus of 1000 
types following the true Zipf distribution pk oc k~ a , where the exponent a = 1. 
In the second simulation, we drew 2000 tokens of words from a corpus of 1000 
types following a Zipf distribution with the exponents a = 0, 0.5, 1. The results 
of the two simulations are shown in Figures 2(a) and 2(b), respectively. Given 
the number of sampled types, the maximum likelihood inference gave estima- 
tors close to the true number types, i.e., 1000. Thus, these simulations validate 
the use of the maximized likelihood estimator based on the Poisson binomial 
approximation. 

4. Future Work 

The derived type-token distribution P(K\M ,Af) is not limited to a particular 
word distribution (e.g., the Zipf distribution as demonstrated in Section 3) and 
is applicable to any word distribution. Thus, a potential direction for future 
research is the modeling and estimation of the class of the word distribution Af 
as a set of parameters. Although the maximum degrees of freedom in Af is N — 1, 
there would be too many parameters, which would cause an overfitting problem 
(i.e., inaccuracy in the parameter estimation) for a large number of potential 
words N and a relatively small token size. Thus, modeling an unknown word 
distribution Af needs to be flexible, yet economical. A potential solution is the 
nonparametric Bayesian approach, which models an unknown word distribution 
Af as a Dirichlet process in which a varying number of type probabilities pi can 
be drawn [10]. 

Another suggestion for future studies is to develop the current theory to in- 
tegrate frequency spectrum methods, where the frequency spectrum is another 
way of characterizing the number of latent types [5, 13, 14], giving the "fre- 
quency of frequency", defined in Section 1. Since we also consider a varying 
token size M in the type-token approach, let jf' 1 denote the frequency spec- 
trum of frequency i in M tokens. By definition, P(K\M, Af) = Y%=\ ■ The 
type-token distribution analyzes the number of observed types as a function of 
token size M while the frequency spectrum analyzes distribution over frequency 
fi for a fixed M. Although these methods have generally been analyzed sepa- 
rately, they can be written as two special cases in an extended form as follows. 
Suppose P(si,s 2 , ■ • ■ , s n \M,Af) is the probability that the type i £ Sj, which is 
drawn by chance with probability pi, appears j times in M tokens. Then, the 
probability of the k th frequency spectrum fjf being K is as follows: 

P(/f = K\M,Af)= P(si,.--,s k ,...s n \M,Af), (4.1) 

s k :\s k \=K 

where the sum is over all combinations of {s\, S2, ■ ■ ■ , s n } satisfying |sfc| = 
K. Thus, the probabilistic mass function of P{f^ [ = K\M,Af) is given by 
P(si, . . . , Sk, ■ ■ ■ s n \M,Af). Let Sf~+ be the set of types with frequency equal to 




Average Sampled Types 

Fig 1, (a) The average number of types as a function of the number of tokens and (b) 
the standard deviation of types as function of the average when sampling five hundred words 
following a power distribution. 
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Fig 2. The estimated number of types for the data drawn from the power distribution (a) 
as a function of the number of tokens (M = 1000,1500,2000,) and (b) as a function of 
the exponent (a = 0,0.5, 1). The averages and standard deviations of the estimators for the 
hundred datasets are shown. 
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k or larger. Then, Equation (2.1) is identical to P(s\M,Af) = P(si + \M,Af) as 
a special case. 

Moreover, extending Equation (2.4), we obtain the following recursive equa- 
tion for P(s 1 ,s 2 \M,Af): 

P( Sl ,s 2 \M,J^=Y^ Pi P( Sl \i,s 2 \M -l,J^ + ^ Pi P({ Sl ,i}, S2 \i\M -l,^ 

+ p S2 P{sus 2 \M -l.N). 

This recursive equation is easily extended to P(s\, s n \M,Af). Thus, 

the solution of these extended recursive equations gives the desired probabilistic 
distribution (Equation 4.1). We expect to report the computable form, cither 
rigorous or approximated, of Equation (4.1) in future work. 

5. Conclusion 

This study explored the probabilistic distribution P(K\M,Af) of the number 
of types K given a number of tokens M for an arbitrary word distribution A/" 
and its theoretical properties, including its moment-generating function. We 
also showed that a Poisson binomial distribution approximates the probabilistic 
distribution asymptotically for a larger number of tokens. The numerical ex- 
periments validated the maximum likelihood inference based on the asymptotic 
Poisson binomial distribution as a reasonable estimator of the number of types 
when a dataset follows a Zipf distribution. Moreover, we derived a Fisher infor- 
mation matrix for the asymptotic distribution and proposed its use for designing 
a word sampling procedure. 

Finally, we mentioned potential future work on the basis of the present work. 
We suggested the nonparamctric Baycsian approach for modeling an unknown 
word distribution . We also outlined a way to integrate frequency spectrum 
methods, which will be developed further in future work. 
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